Inverse design of proteins with hydrophobic and polar amino acids. 
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A two amino acid (hydrophobic and polar) scheme is used to perform the design on target 
conformations corresponding to the native states of twenty single chain proteins. Strikingly, the 
percentage of successful identification of the nature of the residues benchmarked against naturally 
occurring proteins and their homologues is around 75 % independent of the complexity of the 
design procedure. Typically, the lowest success rate occurs for residues such as alanine that have 
a high secondary structure functionality. Using a simple lattice model, we argue that one possible 
shortcoming of the model studied may involve the coarse- graining of the twenty kinds of amino acids 
into just two effective types. 

Keywords: proteins, inverse design, negative design, numerical optimization 

PACS-classification: 87.10.+e, 87.15By 

I. INTRODUCTION 

The gigantic efforts spent by the scientific community over the past decades have unravelled many of the mysteries 
lying behind the chemistry and biological functionality of proteins ^, |^. However, two fundamental questions remain 
unanswered: 

1. what are the mechanisms that guide the folding of a string of amino acids into a complicated three-dimensional 
structure of hehces, loops and /3-sheets, 

2. how does one identify the sequence of amino acids that folds into a pre-assigned native structure. 

This study, building on the pioneering work of Shakhnovich and Gutin n9 and Sun et al. pO|, deals with the second 
issue which is commonly referred to as the inverse folding problem 0, |, ^, |2l|. This problem 

is central in many areas of biology and medicine; indeed the possibility of designing artificial proteins would open 
far-reaching pharmaceutical applications. 

The complexity of the problem is enormous because, in principle, it entails an exhaustive comparison of the native 
states of all sequences in search for the one(s) matching the desired target structure |5[ ^ 

This procedure has been recently formulated into a general mathematical form appropriate for numerical imple- 
mentation |l^ which shows that solving the design problem for a structure, F, amounts to the identification of the 
amino acid sequence, S, that maximizes the occupation probability, Py{S), 



where /3 is the Boltzmann weight, Er{S) is the energy of the sequence S over the structure F and the sum in the 
denominator is taken over all possible structures, {F'} having the same length of F. The winning sequence will 
maximize Pr{S) at all temperatures below the folding transition temperature (where the occupation probability of 
the native state is macroscopic). 

The previous studies of |l^ and Sun et al. consisted of finding a sequence that has the lowest possible energy in 
the putative native state. This is equivalent to the assumption that F{S) in the above equation is independent of S. 
The latter study used a modified Hamiltonian that included a chemical potential term that controlled the number of 
hydrophobic residues and may be interpreted as a first approximation for the inclusion of an F{S) term (see later). 

Two main obstacles need to be overcome to implement (|l|). The first is that one needs to know how to calculate 
Er{S) and F{S); second, it is necessary to explore the whole space of sequences to find the one maximizing (||). 

A commonly used simplification of the latter problem is effected on coarse graining the 20 types of naturally 
occurring amino acids into just two dominant classes: hydrophobic (H) and polar, (P). The evidence in favour of this 
subdivision is considerable |[ ^ 
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In fact, it has been shown that, to a large extent, the folding of proteins is driven by the collapse of H residues into 
a compact hydrophobic core surrounded by polar amino acids or solvent molecules ^, 0, ||. On top of that it appears 
possible to take a protein and exchange some of its amino acids within the H (or P) class without changing its native 
structure notably 0. 

In this paper we will adopt this point of view and perform a design on "real structures" (i.e., structures extracted 
from the Protein Data Bank) within the HP framework. We will address the difficulty of maximizing (|l|) by considering 
a series of approximate forms for Er{S) and F(S) of increasing complexity and refinement. For each design attempt, 
we present a detailed summary of the failure rates of identifying the correct class of each amino acid. It will appear 
that the lowest success is reached for residues with a high secondary structure functionality (especially alanine). We 
find that, though the failure rates on individual amino acids varies significantly over the various design attempts, the 
overall design success remains nearly constant. This is possibly suggestive of a limitation of the HP coarse-graining; 
in support of this we will present evidence showing that the the HP coarse-graining hinders the success of design 
strategies in a solvable protein model. 



II. THE DESIGN ALGORITHMS 



To test the design algorithm we first chose a set of 20 single chain proteins from the protein data bank (PDB); here 
and below the proteins will be identified with their index number as in Table |. 

In order to perform the HP design, we begin by substituting each amino acid unit with a fictitious residue placed 
at a distance of 3 A from the protein backbone along the — Cp direction of the true residue, following Sun et al. 

For GLY, which does not have a /3-Carbon, the fictitious residue coincided with the a-Carbon Following this 
procedure, one obtains the bare backbone of the original protein stripped from any information that could distinguish 
different types of amino acids. 

The goal of the design procedure was to identify the polarity or hydrophobicity of fictitious residues exactly as in the 
true sequence, where the 20 aminoacids were coarse-grained as in Table g. 

From table |l| it can be calculated that the fraction of H residues is about 41%. If one were to use a totally random 
method for guessing the correct class of residues, while respecting the H/P ratio r = 0.41/0.59 one would obtain a 
relative success of 

only slightly higher than without constraining r. This result is to be borne in mind when assessing the performance 
of the design procedures presented in the remainder of this section. 



A. Method 1 



The coarsest design attempt that can be tried is to establish the polarity of a residue according to the number of 
neighbours in contact with it. Customarily two non-consecutive residues, i and j, are said to be in contact if their 
distance, , falls within a range of around 7 A and no other residue is between them. To each residue, i, in the target 
structure we assigned a contact score according to the rule 

3 

which weights the strength of contact interactions with a smooth sigmoidal function ^ The superscript tilde in 
equation (p) indicates that the sum is not taken over j S {i — + 1}, nor over residues that are spaced more than 
lOA apart. The latter constraint is used to avoid the possibility that an intervening residue is between sites i and j. 
The residues which have a contact score greater than 5 were chosen to belong to the H class while the others were 
considered polar (P). Finally, the obtained string was compared with the true (HP-coarse grained) protein sequence. 
It turned out that, on average, the design procedure identified the correct class of amino acid residues 73% of the 
time. This very simple design strategy matches the performance of previous design attempts which were based on 
more complex algorithms where, for example, the polarity of residues was assigned according to the exposed area of 
their Van der Waals spheres (Sun et al. |2^. 

The detailed success rate on each protein in our chosen set is given in Fig. ^ Fig. ^, on the other hand, shows the 
frequency with which a given amino acid was assigned to the wrong HP class. These results, which to our knowledge 
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were not considered in previous studies, provide useful insight for the design failure. For example, from Fig. it 
appears that the class identification failure is highest for alanine, for which it is slightly higher than 50%. This 
can probably be ascribed to the fact that the sites occupied by alanine are determined more on the basis of steric 
interactions than hydrophobic ones. The failure of identilying alanine as hydrophobic seriously affects the overall 
design success rate; in fact, nearly 10% of proteins residues are of this type (see Table O). 



B. Method 2 



It is perhaps surprising that the previous design method can yield a success rate of 73% especially in view of the 
fact that it only uses local geometrical information about the target structure, F. In this subsection, we go beyond 
this approximation and take as the solution to the design problem, the sequence, S minimizing the following form 
(see also Sun et al. |2^) for K{T, S) appearing in (|l|), 

where Si = [Si — 1] if residue i is of type P [H], e is the contact energy matrix, f{rij) is the sigmoidal contact-strength 
function, 

r (l + e'^--6.5)-i ifj^{i_i,i,i + i}andr,, <10A, , . 

•^*^*^''~\0 otherwise, ^' 

and /I is a positive quantity. The contact matrix, e is chosen to be symmetric (i.e., e(0, 1) = e(l,0)); moreover the 
energy scale is set so that e(l, 1) = —1. The second term in (^) does not depend on the target structure, F, and 
is to be regarded as an approximate expression for the free-energy , F{S), whose effect is to control the ratio of 
H to P residues in designed sequences. This particular approximation for F was inspired by the fact that, on HP 
lattice models, sequences with the same number of H and P residues have nearly the same free-energy |lj. The 
introduction of the "chemical potential", /i, appears the simplest way of sifting through the sequence space to retain 
only sequences with the designed H-P ratio. Among this subset, the putative solution to the design problem will be 
the sequence minimizing the contact energy term. On the basis of the success of an analogous HP design strategy for 
three-dimensional lattice structures |2|, it can be expected that the sequence minimizing (jj) will be close to the true 
protein sequence. 

The quantities e(l, 0), e(0, 0) and /i appearing in (^ are regarded as parameters to be optimized in order to maximize 
the overall design success rate. This step can be viewed as a way of extracting the HP contact energies from proteins 
of known sequence and conformation. 

The design procedure engine consists of the following two steps: 

1. for a given set of parameters the sequence S minimizing (^ is identified with a simulated annealing procedure 
(the elementary move being the mutation of a fraction of residues from one class to the other) , 

2. the parameters are varied, and step 1 is repeated, in the attempt to identify the set of values giving the highest 
average design success rate. 

The highest success rate was found for 

e(0, 1) = 0.006025 , e(0, 0) = 1.481606 , /i = 6.1909 (6) 

and was equal to 73.4% as shown in Fig. ||. 

Figure ^ represents the ribbon plot of protein 3rn3, where the design success was 73.3%. Helices and /? — sheets are 
coloured in purple and yellow respectively, while the black portions mark the residues whose hydrophobicity/polarity 
was not recognized correctly by the design method. 

Remarkably, despite the increased computational effort, the overall success rate has not improved appreciably over 
the previous design attempt. In particular it can be noted that, while the failure rate over the individual files is 
roughly the same as in Fig. |^, Figs. ^ and || differ significantly. The average failure in identifying P-type residues has 
decreased significantly while the failure rate for alanine has grown from 50% to 60%. As mentioned before this can 
be ascribed to the small volume of alanine which favours its location in protein structures on the basis of excluded 
volume reasons. A high failure rate also affects mctionine, which is a common helix-former. Hence, it appears that 
the highest failure rate is found among residues whose location in the protein is dictated by specific functionality 
rather than mere energetic considerations. 
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C. Method 3 



In a third attempt we tried to improve the approximation for the free energy term, F{S), used in A possible 
way of constructing approximate forms for F{S) is to include in it all possible constraints that are satisfied by real 
protein sequences. One of these constraints is that, using the coarse-graining scheme of Tabic |l[ the ratio oi H to P 
residues must be. 



0.71 



(7) 



The previous method tried to tune in to the right value of r by optimizing the "chemical potential" , ^. 

By studying the statistical properties of protein chains we have been able to identify another constraint which, to 
our knowledge, had not been previously identified. In fact, we have established that the number of H segments, S//, 
in a variety of proteins grows linearly with the length, L, of the chain. As can be seen in Fig. ^ the linear behaviour 
is marked and, in fact, the linear correlation coefficient over our set of 20 proteins was equal to 0.964. 
A linear regression of the points in Fig. ^ gives the following equation for the interpolating line: 



= 0.547889 + 0.252676 • L 



(8) 



For chains of length L « 130 the difference between the true value of (« 35) and the one estimated with (||) was of 
the order of 3 units. Given the high degree of reliablity of constraint (^) we decided to incorporate it in our expression 
for F{S). In fact, it turned out that the sequences designed with the previous method had rather low values of S//; 
in other words the H residues tended to cluster together in relatively long segments. 

Our expectation was that the simultaneous requirement that design sequences should obey both (Q) and (|^) would 
be an efficient sieve for isolating good sequences. Hence we adopted the following form for K{r, S), 

i^r(r,^)=5]e(5„^,)/(r,,) + 



V2{nHiS)-0A15- LY' 



(9) 



where L, Nh{S) and T,h{S) are, respectively, the length of the chain, the number of H residues in 5* and the number 
of H segments in S. The term in square bracket in equation can be regarded as an expansion of the free energy, 
F(S) (see eqn. (^), to fourth order in the spin variables, 5*^. In fact, it can be equivalently recast into, 



L-l r 

E 



{Si — Si+iY + So + Sl — Tjh{L) 



t—1 ^ ^ 



(10) 



In equation (^, the amplitudes of the two potential wells, V\ and V2, were chosen to be of the order of 100 
(the energy unit is |e(l,l)| = 1). By using the strategy described in subsection B we obtained the best results for 
e(l,0) = 0.285509 and e(0, 0) = 1.520444, for which the design success rate was 73.1%, slightly lower than for the 
previous attempt. The detailed results are summarized in Figs. |^ and |^. Once more, the failure rates on individual 
amino acids differ appreciably from previous attempts (the amino acid failure rates are, however, robust against small 
changes in the optimized parameters). 

An artificial way of increasing the design success would be to include alanine in the class of polar residues. This 
would make the success rate of method 1 grow to 73.7% . For the second method the success increase is, a priori 
not as easy to estimate. In fact, it can be expected that changing the class of alanine could trigger a cascade effect 
modifying the overall success score. 

Surprisingly this was not the case; by changing alanine from H to P the best overall success grew to 75.1%. This is 
almost exactly the value that one would have obtained by taking, for alanine, the complementary of the failure rate 
in Fig. ^, suggesting an apparently weak correlation between the position of alanine and other residues. 



D. Structural homology 

We complete the discussion of our design attempts by considering the case of protein homology. It is well-known 
, ^, ^ that there is no strict one-to-one correspondence between structures and sequences; indeed, it is often the case 
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that native states of proteins whose mutual identity is around 80% fold into nearly the same structure (the RMSD 
being typically less than 0.8 A per residue). In this case the two proteins are said to be homologous. A thorough 
check of a design procedure should then allow for the possibility that the designed sequence, S' , is homologous to the 
target one, S. This can conveniently be done by comparing S" with all the known sequences homologous to S. 



Table ID lists, for each of our 20 target structures, the names of proteins with the same length and high structural 
identity (this includes, of course, the true target sequence). It can be noticed that, in a number of cases, it was not 
possible to identify any homologous sequence. 

We carried out the same design procedure as in method 3, but with the following proviso: in case a target structure 



is associated with more than one sequence in Table III, we measure the design success against each sequence in this 
set and then take the highest value as the design score. 

It turned out that, for nearly all cases, the highest sequence identity was attained with the true target sequence, 
not the homologous ones. For this reason the overall success was 73.4%, very close to the previous result. 

The inability to improve significantly the success score over the four attempts discussed above is, possibly, suggestive 
that important features of real proteins have been mistreated. 

One possibility (K. A. Dill, private communication) is that naturally occurring proteins may not have necessarily 
evolved to maximize the occupation probability (^. According to this point of view, designed sequences may differ 
from natural ones because they have a higher thermodynamic stability. 

Another possible explanation for our inability to correctly recognize as many as 25% of the residues may be the 
coarse graining of the 20 types of amino acids into just two classes, H and P. In the following section we will try to 
explore this possibility by studying a lattice model for proteins. Indeed it will appear that the HP coarse graining 
seriously limits the maximum achievable success rate. 

III. DISADVANTAGES OF THE HP COARSE-GRAINING 

In this section we want to highlight the fact that an H-P coarse-graining can partially mislead the design procedure. 
To substantiate this claim we perform a lattice test in which proteins are schematically represented by self-avoiding 
walks on a square lattice. The amino acids are located at the sites touched by the chains and they are divided in four 
types, according to their different degree of hydrophobicity or polarity: Hi, H2, Pi, -P2. Two non-consecutive amino 
acids will be said to be in contact if they are spaced only one lattice unit apart. The contact potentials, u{ai,aj) 
(where a G {Hi, H2, Pi, P2}) between two interacting sites were chosen to be 



/ -2 --2 -2 -1 \ 
/ J J _? _? ^ 

J _i J J 

vli J -1 -! / 

\ 2 3 8 4 / 



(11) 



The entries of the matrix u were chosen so that the residues Hi and H2 are energetically favoured to stay in the interior 
of the protein, unlike Pi and P2. We consider chains made of 12 residues, a case where an exhaustive enumeration 
of all possible conformations can be carried out with modest numerical effort. Thus, for a given sequence of amino 
acids S = {(Ti, (72, ...,0-12}, we can verify if it has a unique ground state (i.e. if it is a stable protein) or not. We then 
generated and stored a set of 1000 sequences Si, S2, S'looo with unique ground state conformations, T{Si). Our aim 
was to verify whether a coarse graining of the original 4 types of amino acids into just two HP classes still allows a 
correct design over the 1000 conformations. As a first attempt, we gathered the amino acids Hi and H2 in class H 
and Pi, P2 in class P and introduced a new set of interaction potentials, u{<7i, Uj) between these two new classes by 
averaging out the corresponding 2x2 H-P blocks in the matrix (O) , 



-3 _i \ 

? _§ . (12) 

2 8 / 

Then we transformed each sequence Si in the corresponding coarse grained form, Si and by means of exact enumer- 
ation we tried to verify if Si still had a unique state of lowest energy (adopting the coarse grained interactions) on 
conformation T{Si). Disappointingly, only in 718 cases was the answer affirmative. A case where the procedure fails 
is shown in Fig. ||. 

To make our test more stringent, we repeated the procedure by setting 

u{H,H) = -l, (13) 
u{H,P) = u{H,P) = z (14) 
{H,H)^y (15) 



u 
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and optimally varied y and z in order to maximize the number of ground states correctly predicted. We have found 
an optimal solution for z = y = 0.86 for which 830 of the original 1000 conformations were identified as native states 
of the coarse-grained sequences. 

Although one cannot extend the same quantitative study of the effects of the HP coarse graining on real protein 
design, the results given above suggest that not only they can be non negligible, but they can significantly reduce the 
maximum design success that can be achieved. 



IV. SUMMARY 



To summarize we have presented several protein design methods for identifying the correct hydrophobic or polar 
(H/P) class of residues on a set of 20 proteins. The simplest of these techniques, which is readily implemented and 
takes nearly no CPU time, gave a success rate of 73%. More sophisticated methods, encompassing negative design 
features, failed to improve appreciably upon the previous attempt. 

We have presented evidence showing that the highest failure rate was found on residues with high secondary-structure 
functionality. On the basis of numerical results on exactly solvable models, it was also suggested that the HP coarse- 
graining may significantly impair the design algorithms, thus possibly accounting for the failure to increase the design 
success score by using algorithms of growing complexity. 
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Label 


PDB File 


Len 


1 


Ibba 


36 


2 


Ibbl 


37 


3 


3ebx 


62 


4 


laba 


87 


5 


2hpr 


87 


6 


laps 


98 


7 


laaj 


105 


8 


lerv 


105 


9 


lycc 


108 


10 


5cpv 


108 


11 


3rn3 


124 


12 


Ihel 


129 


13 


lifb 


131 


14 


lecd 


136 


15 


losa 


148 


16 


Imbd 


153 


17 


lra8 


159 


18 


1192 


162 


19 


21zm 


164 


20 


9pap 


212 



TABLE L The names of the 20 proteins taken from the Protein Data Bank and used in out design studies are given in 
column two. The number of amino acids in each protein is given in the last column. 



Amino Acid 


Type 


Frcq. (%) 


ALA 


H 


8.85 


VAL 


H 


6.59 


LEU 


H 


6.85 


ILE 


H 


5.36 


CYS 


H 


2.04 


MET 


H 


2.42 


PHE 


H 


4.38 


TYR 


H 


3.32 


TRP 


H 


1.40 


GLY 


P 


7.91 


PRO 


P 


3.23 


HIS 


P 


2.17 


SER 


P 


5.70 


THR 


P 


5.87 


LYS 


P 


8.08 


ARG 


P 


4.64 


ASP 


P 


5.96 


ASN 


P 


5.15 


GLU 


P 


6.76 


GLN 


P 


3.32 



TABLE II. Following Sun et al. ^ we distributed the 20 amino acids in H/P classes as in the second column. The third 
column gives the relative frequency of a given amino acid in the 20 proteins of Table B. 
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Target structure 


Homologous Sequences 


ibba 


Ibba 


Ibbl 


Ibbl 


3ebx 


Sebx, 5ebx, Inxb 


laba 


laba 


2hpr 


2hpr 


laps 


laps 


laaj 


laaj 


lerv 


lerv, leru, Strx, Itrw 


lycc 


lycc, Icsu, Iraq, Icrj, Icig 


5cpv 


5cpv 


3rn3 


3rn3 


Ihel 


Ihel 


lifb 


lifb 


Iced 


Iced 


losa 


losa, 3cln, 4cln, IcU 


Imbd 


Imbd 


IraS 


IraS, Ijom 


1192 


1192, 71zm, 1174, 1172, 1164 


21zm 


21zm 


9pap 


9pap 



TABLE III. For each protein in the original set (first column) we identified, when possible, a set of proteins with high 
structural identity (second column). The outcome of our design procedure was then checked against the set of structurally 
homologous sequences. 
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Attempt 1 
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Weighted success average: 73.0% 
Percentage of H residues: 35.7 
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FIG. 1. Histogram of tiie design success rates on each of the 20 proteins of Table | obtained using method 1. 
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FIG. 2. Histogram of the failure rates in identifying the correct H/P class of the 20 amino acids obtained with method 1. 
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Attempt 2 




FIG. 3. Histogram of the design success rates on each of the 20 proteins of Table | obtained using method 2. 




FIG. 4. Ribbon plot of protein 3rn3. The black sections highlight residues which were wrongly assigned by the second design 
procedure. 
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Attempt 2 
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FIG. 5. Histogram of the failure rates in identifying the correct H/P class of the 20 amino acids obtained with method 2. 
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FIG. 6. Plot of the number of H segments, T.h for our set of 20 proteins as a function of the protein length, L. The solid 
curve is the interpolating line (see equation Bl). 
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Attempt 3 
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Weighted success average: 73.1% 
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FIG. 7. Histogram of the design success rates on each of the 20 proteins of Table | obtained using method 3. 
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FIG. 8. Histogram of the failure rates in identifying the correct H/P class of the 20 amino acids obtained with method 3. 
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FIG. 9. a) The unique ground state conformation for sequence S = {Pi,Hi,Pi,H2,Hi,P2,Hi,Hi,Hi,P2,H2,P2}- The 
ground state energy of S is -8. The HP coarse graining of S yields a sequence, S = {P, H, P, H, H, P, H, H, H, P, H, P} which 
has energy -6.5 on conformation (a). The true ground state of S has energy -6.875 and is represented in (b). 
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